Perceiving Tempo in Incongruent Audiovisual Presentations of Human Motion: Evidence for a Visual Driving Effect

Compared to vision, audition has been considered to be the dominant sensory modality for temporal processing. Nevertheless, recent research suggests the opposite, such that the apparent inferiority of visual information in tempo judgements might be due to the lack of ecological validity of experimental stimuli, and reliable visual movements may have the potential to alter the temporal location of perceived auditory inputs. To explore the role of audition and vision in overall time perception, audiovisual stimuli with various degrees of temporal congruence were developed in the current study. We investigated which sensory modality weighs more in holistic tempo judgements with conflicting audiovisual information, and whether biological motion (point-light displays of dancers) rather than auditory cues (rhythmic beats) dominate judgements of tempo. A bisection experiment found that participants relied more on visual tempo compared to auditory tempo in overall tempo judgements. For fast tempi (150 to 180 BPM), participants judged ‘fast’ significantly more often with visual cues regardless of the auditory tempo, whereas for slow tempi (60 to 90 BPM), they did so significantly less often. Our results support the notion that visual stimuli with higher ecological validity have the potential to drive up or down the holistic perception of tempo. from human behaviours and social activities (Boltz, 2005), often yield movements with high compatibility (Hove et al., 2010). The point-light technique for human motion has first been used in experimental research by Johansson (1973). Point-light displays (PLD) of biological motion are derived from natural human (or animal) movements such that visual details including facial expressions or clothes are not shown, yet the naturalness in terms of movement kinematics is preserved. Abstract visual stimuli such as flashes, on the other hand, provide fewer cues, no biological movement trajectories, and thus a less rich temporal structure. An earlier attempt by Boltz (2005) compared the effects of naturalistic scenes when they were presented in either auditory, visual, or audiovisual channels. Results suggested that modality differences were not observed in terms of duration reproduction or estimation accuracy. This led to a string of studies that adopted the naturalistic approach, in other words using visual movements that were common


Introduction
Perceiving inconsistent audiovisual information is common in daily life. In many cases, conflicting inputs of one modality are able to alter the percept of another. The McGurk effect (McGurk & MacDonald, 1976), for example, is a famous example that lip movements not corresponding to the speech alternate the perceived sounds. Different pianists' performances coupled with the same soundtrack have been perceived to be different in a number of dimensions (Behne & Wöllner, 2011). It is of interests whether similar observations can be extended to the perception of timing and tempo. The question of how audiovisual asynchrony affects temporal processing has also attracted much attention. The dominant role of audition has been long recognised in temporal processing in the sense that it provides higher accuracy and precision than vision (Grondin, 2010). However, this view is challenged by emerging evidence of the superiority of meaningful visual movements in time perception with both abstract (Grahn, 2012) and real-life stimuli (Hove & Keller, 2010;London et al., 2016). Thus, it remains controversial whether audition or vision dominates our perception of tempo.
There has been a long debate in research about whether timing is based on a central or on a distributed system (Occelli et al., 2011;Penney, 2003;Van Wassenhove et al., 2008; for an overview, see Wang & Wöllner, 2019). Some argue that the timing mechanism is distributed, which can explain the discrepancies in timing performance between, for instance, vision and audition (Grondin et al., 2008); while others favour the notion that the modality difference in time perception comes from the interaction between different sensory modalities in the central timing (e.g., Levitan et al., 2015). The distributed account is supported by findings that audition has an advantage over vision and other sensory modalities in terms of duration discrimination (Grondin et al., 2008), reproduction (Gamache & Grondin, 2010), and estimation (Kanai et al., 2011). However, it should be noted that the domination of audition in temporal processing does not apply to all cases, particularly with biological trajectories  or movements (Allingham et al., 2020).
Evidence of temporal entrainment, which specifies the synchronisation between two rhythms, has been observed with musical , visual (Iversen et al., 2015) as well as tactile stimuli (Occelli et al., 2011). For tempo judgements, the temporal ventriloquism effect (Burr et al., 2009) and the auditory driving effect (Shipley, 1964) have both suggested the dominance of audition over visual displays by 'dragging' the temporal location of the latter to that of the former. Even when the auditory stimuli were not attended to, or reduced in salience, duration judgements clearly leaned towards that of the perceived tone rather than of the visual circle in this case, suggesting that the processing of auditory temporal cues was possibly autonomous and occupied minimal cognitive resources (Ortega et al., 2014). Alternatively, the reliability of the sensory inputs was crucial when perceiving durations: whichever channel provided the least noise was assigned the most weight in duration estimation (Hartcher-O'Brien et al., 2014;Shi et al., 2010). It is therefore likely that auditory temporal inputs were more reliable in studies where audition outweighed vision, given that multisensory temporal information is integrated in an optimal fashion (Shi et al., 2013).
Relatively few studies have explored tempo judgement in the context of audiovisual stimuli with high ecological validity. Among those who adopted naturalistic visual stimuli, a strong influence of the visual over the auditory inputs has been found. For example, videos of musicians playing long notes on the marimba, coupled with long and short corresponding sounds, shifted the perceived note length towards 'long' (Schutz & Lipscomb, 2007). The effect of human pointlight displays on unimodal (auditory or visual) and bimodal (audiovisual) stimuli of varying musical tempi indicated that movements of high energy led to faster perceived tempo of auditory stimuli (London et al., 2016). However, it remains unclear how tempo is holistically perceived when participants are asked to judge it based on both sensory information channels. Furthermore, to our knowledge, no study investigated a combination of audition and vision in a controlled variation of audiovisually inconsistent stimuli.
Evidence has suggested that vision might not always be less accurate than audition in duration and tempo judgements. Past research believes that vision dominates spatial rather than temporal localisation (e.g., Burr et al., 2009;Repp, 2003), and has a lower temporal sensitivity than audition in low-level information processing (Ortega et al., 2014). However, it should be noted that the evidence for high precision of audition comes from simple and controlled laboratory stimuli, and the arrangements of their presence in the task, such as simple visual flickers (Shipley, 1964;Treisman et al., 1990), coloured squares (Grahn et al., 2011), looming or receding discs or dots (Van Wassenhove et al., 2008) rather than naturalistic stimuli. Naturalistic stimuli, such as biological motion derived from human behaviours and social activities (Boltz, 2005), often yield movements with high compatibility . The point-light technique for human motion has first been used in experimental research by Johansson (1973). Point-light displays (PLD) of biological motion are derived from natural human (or animal) movements such that visual details including facial expressions or clothes are not shown, yet the naturalness in terms of movement kinematics is preserved. Abstract visual stimuli such as flashes, on the other hand, provide fewer cues, no biological movement trajectories, and thus a less rich temporal structure. An earlier attempt by Boltz (2005) compared the effects of naturalistic scenes when they were presented in either auditory, visual, or audiovisual channels. Results suggested that modality differences were not observed in terms of duration reproduction or estimation accuracy. This led to a string of studies that adopted the naturalistic approach, in other words using visual movements that were common in everyday life to achieve the same salience and temporal discriminability as of the auditory stimuli presented in past research (e.g., Grahn, 2012;Hove et al., 2013;London et al., 2016). Extraction of rhythmic patterns from visual movements was not only viable, but also independent from auditory interferences, indicating a robust temporal representation in the visual module (Su & Salazar-López, 2016).
The above evidence calls for further exploration in multisensory timing. The current study intends to advance research as follows: (1) Provide a scenario where auditory and visual information is equally important to tempo judgements. (2) Explore the potential interaction of audition and vision in tempo judgement. In light of this, the current study examined the effects of competing auditory and visual information on tempo judgement with biological motion and drumbeats, taking into account the ecological validity of both. We hypothesised that, with meaningful visual information such as point-light displays (PLDs), the visual tempo relative to auditory tempo would contribute more to the overall tempo judgement. A given unit change in visual tempo should thus lead to larger changes in the tempo judgement ratio (fast/slow). Accordingly, more participants should rely on visual rather than auditory information when judging tempo. Finally, we hypothesised that perceived naturalness would decrease as the audiovisual tempo discrepancy increases.

Participants
Twenty-four participants were recruited for the study (12 female; aged M = 24.21 years, SD = 4.68). Participants had a mean of 10.04 years (SD = 7.09) of regular practice with musical instruments (including voice), and a mean of 7.65 years (SD = 7.18) of lessons on their instrument. Thus, the current sample represents a population that has moderate to advanced musical training. The sample size had been calculated a priori for a 3 × 3 design (α = 0.05, Cohen's f = 0.25, power = 0.8), requiring at least 15 participants (using G*Power; Faul et al., 2009). For a conservative approach, we recruited 24 participants. We also followed the guidelines of the Ethics Committee of the Faculty of Humanities, Universität Hamburg, and each participant was compensated 10 Euro for taking part.

Material
Participants were presented with audiovisual stimuli synthesised from isochronous drumbeats of nine tempi (60,75,90,105,120,135,150,165, 180 beats per minute [BPM]), and visuals of the same tempo spectrum. The visual stimulus showed a PLD of a person jumping from left to right with the hands moving up and down (Fig. 1). This movement pattern was recorded with an eleven-camera motion-capture system (Qualisys Oqus, Qualisys AB, Göteborg, Sweden) at a framerate of 200 frames per second. Thirty-one markers were attached to the performer. The movement pattern was intended to be neither towards an action-based nor to a habitual (highly automatised) outcome, in order to avoid familiarity with the movement (Calvo-Merino et al., 2006). The movement was originally recorded at a speed of 120 BPM. The motion was presented from a 30-degree angle, in frontal view. The MATLAB Motion Capture (MoCap) Toolbox (Burger & Toiviainen, 2013) was adopted to speed up and slow down the PLD to the eight further tempi as specified above, while ensuring that the visual resolution and the number of data points per second were unchanged. The auditory stimuli of nine tempi (60 to 180 BPM, 15 BPM per step), on the other hand, were directly synthesized from a bass drum on the online drumbeat generator Drumbit (https://drumbit.app). Drum beats can be found in real life scenarios such as listening to techno music, thus providing higher naturalness than abstract auditory stimuli adopted in past studies such as sine waves. The PLDs and the drumbeat soundtracks of all tempi were then combined in Adobe Premiere Pro CC 2017 (Adobe Systems, San Jose, CA, USA) to create a total of 81 stimuli with all audiovisual tempo combinations. That is, all stimuli were bimodal videos varying in tempi.
The experiment was conducted in the SloMo laboratory at Universität Hamburg on a Dell U2414Hb monitor (Dell Technologies Inc., Round Rock, TX, USA), controlled by the software OpenSesame (Mathôt et al., 2012). A Sennheiser HD600 headphone set (Sennheiser GmbH, Hanover, Germany) was provided for the soundtrack. Participants responded to the experimental task by pressing the leftward or rightward button on the keyboard.

Design and Procedure
The current study introduced a 3 × 3 design where three auditory tempi ranges (slow: 60, 75, 90 BPM; medium: 105, 120, 135 BPM; fast: 150, 165, 180 BPM) and three visual tempi of the same spectrum acted as the independent variables, while taking the corresponding tempo judgement as the dependent variable. We chose the temporal bisection, a two-alternative forced-choice task (2AFC), to examine the ratio of 'fast' judgement at different tempo and modality conditions. The 2AFC method has been used in various studies of audiovisual integration (Chen et al., 2018;Gori et al., 2012;Shi et al., 2010). The bisection task has been widely used in cue combination research for both spatial stimuli and time in audiovisual integration processes (Gori et al., 2012). In an audiovisual Ternus apparent motion study, Shi et al. (2010) used the bisection task to measure the audiovisual duration integration. Roach et al. (2006) adopted the 2AFC temporal discrimination (higher or lower than 10 Hz) to estimate the threshold for the audiovisual temporal integration. Similar to other direct measures of duration or tempi, such as reproduction tasks, the bisection task is able to probe the audiovisual temporal integration as well as decisions in the tempo judgements. One benefit of using the bisection task, compared to the direct tempo reproduction or other motor-related tasks, is that the task is not influenced by motor noise. In a similar manner, here we applied the temporal bisection point to measure the holistic tempo judgements, that is whether observers shifted their judgements towards a fast or a slow tempo.
In the current study, participants were first presented with two audiovisual anchors (a fast tempo and a slow tempo) at the beginning of the experiment and were asked to judge the tempo of a given stimulus as close to the fast or the slow tempo holistically. In other words, they should focus on both auditory and visual information in the video stimuli. The slow anchor was a bimodal video with an audiovisually consistent tempo at 60 BPM, and a fast anchor at 180 BPM. They were then shown randomised trials of 81 bimodal videos generated from nine auditory stimuli (60,75,90,105,120,135,150,165,180 BPM) and nine visual stimuli of the same tempo spectrum, repeated three times. Each auditory tempo was combined with each visual tempo. The bimodal stimuli include both tempoconsistent and -inconsistent presentations. A total of 243 trials were presented to each participant.
Participants were seated in a quiet room approximately 80 cm in front of the monitor. Instructions were given by an experimenter who was trained to follow fixed protocols to ensure a standardised procedure. Each trial started with a fixation point for 100ms, followed by a PLD presentation of 5 s while drum sounds were simultaneously played through the headphones. After the presentation, a '?' was shown in the centre of the screen, prompting participants to judge if the tempo of the presented stimulus was closer to the slow or the fast anchor tempo. They were asked to press the leftward arrow key for the slow and the rightward arrow key for the fast anchor. To refresh participants' memory of the two anchors, a text reminder 'anchors' popped up after every nine trials, and the anchors were played once each time. Participants pressed any key to proceed and to watch the fast and slow videos, with no time limit imposed. They were not required to make any response. In addition, an optional short break was offered every 40 or 41 trials. After completing the bisection task, participants were asked to rate the naturalness ('how natural does this video feel?') of all 81 conditions in randomised order. A trial started with a fixation point for 100 ms, followed by a 5-s bimodal video stimulus. A visual instruction of the naturalness question 'Please rate how natural the video feels' was presented. On a horizontal gauge bar from 1 (marked as 'least natural') to 100 (marked as 'most natural'), participants placed the cursor in a relative position to give a response.

Data Analyses
All statistical analyses were conducted with R (Version 3.5.3; R Core Team, 2019). Given that the distributions of individual participants' tempo judgements were heavily skewed, we used nonparametric analyses, more specifically a series of chi-square tests, to compare differences between the numbers of 'fast' versus 'slow' judgements for auditory or visual tempo conditions. In addition, we fitted the response ('fast' versus 'slow') as a logistic function of the auditory and visual tempi to obtain a 2D psychometric function, such that we can obtain the points of subjective equality (PSEs). A separation boundary (auditory and visual tempi) by the PSE, yielded by the logistic model when the likelihood of 'fast' judgement was 0.5, was then estimated on the individual and group levels. Pearson's correlations were conducted to explore the relationship between perceived naturalness and audiovisual discrepancy.

Visual Versus Auditory Tempo
We first examined the effect of modality on tempo judgement at each tempo condition. First, we grouped the tempo of the presentation either by (a) tempo of drumbeats or (b) tempo of visual PLD, by slow (60, 75, and 90 BPM), medium (105, 120, and 135 BPM), and fast (150, 165, and 180 BPM) tempo ranges. Figure  2 shows the mean proportion of 'fast' responses as a function of tempo, separated by modality. A chi-square test shows that participants were more likely to judge 'fast' with visual rather than auditory cues, χ2 (1, N = 3878) = 81.96, p < 0.001, with a small to medium effect size (φ = 0.15). Correspondingly, for slow stimuli the proportion of 'fast' responses was significantly higher when participants relied on auditory cues than visual ones, χ2 (1, N = 3907) = 59.16, p < 0.001. The effect size was also small to medium (φ = 0.12), according to Cohen (1988). There was no significant difference between modalities when the stimuli were presented at intermediate tempo, χ2 (1, N = 3879) = 0.07, p = 0.80, φ = 0.004. These results suggest that visual information plays a more important role than the auditory information in tempo judgement at both ends of the tempo spectrum: When the PLDs were shown at a fast (150, 165, 180 BPM) or a slow (60, 75, 90 BPM) tempo, participants judged stimuli overall to be fast or slow, regardless of the auditory tempo.
To get a detailed picture of individual contributions of auditory and visual tempo in temporal judgements, we plotted the average response heatmap in Fig. 3. Both auditory and visual tempo contribute to judgements. In general, the faster the tempo, the more likely a participant would judge 'fast' . Consistent with the analysis shown above, the change of response is more sensitive in the 'vision' direction than in the 'audition' direction, as evinced by the response contour changes along the visual rather than the auditory modality. To further quantify this, we applied a two-dimensional logistic regression, which is an extension of the one-dimensional psychometric function. We assume participants' bisection   The tempi in both conditions were first standardised by dividing each value by the median (120 BPM). The logistic model suggested a significant relationship among tempo judgement and the auditory as well as visual tempo, χ2 (5827) = 2757.21, p < 0.001. McFadden's R2 = 0.34, which fell between 0.2 and 0.4, indicating a good fit. The estimated coefficients are shown in Table 1. The coefficients for the visual and auditory tempi were β V = 5 4 .0 and β A = 3 53 . , respectively. These reflect the degree of sensitivity in change of responses (according to the model, a unit change in the relative tempo contributes a change of log likelihood of two responses). Furthermore, β V was significantly larger than β A (based on the non-overlapping 95% confidence interval, see Table 1), which confirms that in general the visual tempo contributed more than the auditory tempo.

Individual Modality Reliance
Based on the separation boundary, we then categorized participants into one the following types: vision-, audition-, or bimodal-reliant types.We used the log-ratio between the auditory and visual coefficients (| between −0.05 and 0.05 was regarded as equal reliance. A ratio higher than 0.05 suggests auditory reliance, while a ratio lower than −0.05 indicates visual reliance. Figure 4 shows examples of participants for the three types of modality reliance.

/ | β β
According to the categorisation, 16 participants were vision-reliant, seven audition-reliant, and one bimodal-reliant. A chi-square test of independence indicated a significant difference among the three groups, χ2 (2, N = 24) = 14.25, p < 0.001. That is to say, a larger proportion of the sample favoured visual information when it came to tempo judgement, regardless of the auditory tempo (Fig. 5). This finding again supports our hypothesis that the visual tempo, when presented as natural human movements, has higher priority than auditory tempo.

Naturalness and Audiovisual Discrepancy
A Pearson's correlation between the overall naturalness rating, ranging from 0 (least natural) to 100 (very natural), and the absolute values of the audiovisual tempo discrepancy suggested that the smaller the discrepancy between the audio and video tempo, the more natural a stimulus was perceived r(81) = −0.56, p < 0.01 (see Fig. 6).
A two-way ANOVA was conducted to examine the effect of auditory and visual tempo on perceived naturalness. Again, tempo ranges were categorised into slow (60 to 90 BPM), medium (105 to 135 BPM), and fast (150 to 180 BPM) for the analysis. Simple main effects suggested that fast visual tempo led to significantly higher naturalness (F2,1935 = 100.38, p < 0.001). A statistically significant interaction between auditory and visual tempo on perceived naturalness was found, F4,1935 = 37.74, p < 0.001. Tukey's HSD post-hoc tests revealed that, for auditory tempo, no statistically significant differences were observed between different tempi. However, for visual tempo, fast stimuli were associated with higher naturalness ratings than the medium (p < 0.001, d = 0.24) and the slow ones (p < 0.001, d = 0.74). The medium-speed visuals were rated more natural than the slow ones (p < 0.001, d = 0.50), regardless of the auditory tempo.

Discussion
The present study examined the role of audition and vision in tempo judgements of naturalistic stimuli of biological motion, when the tempi of the two modalities are not consistent. First, the tempo of visual information (here the PLD stimuli) affected overall tempo judgements to a greater extent than that of the auditory information (drumbeats). Secondly, a higher proportion of the participants relied on visual rather than on auditory information for tempo judgement. Different modality weightings exhibited by individual participants again support our hypothesis that visual information, when presented as biological motion PLDs, should possess high ecological validity and consequently serves as the dominant tempo reference. Finally, a larger audiovisual tempo discrepancy led to lower perceived naturalness.
The results are consistent with our main hypothesis in the sense that naturalistic visual input dominated overall tempo judgement. Past studies with visual movements of varying complexities have observed similar effects where ecological validity could be derived from the stimuli. For abstract movements, the 'visual driving effect' has been found for both rhythm perception (Su & Jonikaitis, 2011) and duration estimation (Van Wassenhove et al., 2008). Su and Jonikaitis (2011) revealed that changes in the moving speed of dots or in luminance provided  . The x-axis stands for auditory tempo, and the y-axis for visual tempo. The heatmap represents the proportion of 'fast' judgements, with lighter colour for higher proportion of 'fast' judgements. The yellow lines stand for the audiovisual tempi at which participants were equally likely to respond 'fast' or 'slow' .
X. Wang et al./ Timing & Time Perception (2021) Downloaded from Brill.com05/14/2021 09:40:17PM via free access   Other experiments that have adopted more complex audiovisual stimuli than abstract rhythmic sequences have successfully replicated the visual driving effect too. For example, participants were equally accurate in their discrimination of complex rhythmic patterns for both auditory and visual presentation (Grahn, 2012). Further attempts have taken biological movements into account. When watching vigorous dance movements, the music tempo was perceived as faster compared to the conditions where only music, or music and relaxed dance movements were presented (London et al., 2016). Our study also used point-light dancelike movements, which appeared to entrain the overall perceived tempo toward the visual tempo. This suggests that the preferred modality for tempo judgement may not entirely depend on the precision of the modality (i.e., the modality precision account). Rather, it might depend on how well information of this modality could assist with action prediction (i.e., the modality appropriateness account). As discussed earlier in the Introduction, the reliability of the sensory modality determines its contribution weight in the overall judgement (Hartcher-O'Brien et al., 2014;Shi et al., 2013). The more reliable the prediction from the signal, the higher the weight that would be assigned to that modality. Hence, our findings may suggest that predictable biological motions relative to drumbeats may offer reliable cues for temporal judgement.
This in particular is in line with human action prediction. Various studies have suggested that visual attention driven by the action goal was accompanied by higher processing efficiency (Decroix & Kalénine, 2019;Loucks & Pechey, 2016). Loucks and Nagel (2018) found more accurate tempo discrimination performance with human actions compared to non-human actions. In this vein, higher temporal sensitivity with goal-directed biological motions may also be reflected in the current study where repetitive dance-like movements became highly predictable, and thus endowed with more weights in tempo judgement. Compared to drumbeats, the biological movements provide more timing information than discreet bursts of sound. In other words, the continuous nature of the visual information may have provided a more reliable source of tempo information. When both modalities possess information of similar continuities, there might not be an advantageous modality in tempo judgement. In this regard, one of the earlier studies by Boltz (2005) found no modality effect on duration reproduction performances with continuous, natural human behaviours including sports or conversations, presented in either the auditory, visual, or audiovisual channel. However, it is as yet inconclusive whether the continuity or the biological plausibility affects the role of vision in temporal processing. To disentangle the effects of the two features,  examined the efficiency of facilitating audiomotor synchronisation with continuous and/or direction-compatible motions that were either abstract or biological trajectories. Higher synchronisation rates were observed with continuous, direction-compatible, but not necessarily biological motions. It can be speculated that continuous stimuli contribute more temporal information than discontinuous stimuli, regardless of forms (abstract vs biological) and modalities. In the same vein, the continuity of visual and auditory rhythms has direct effects on participants' timing performances: The sensory modality with continuous inputs was assigned more weight than the discontinuous one (Varlet et al., 2012). Similarly, compared with the discrete bursts of sounds, the continuous biological motions in our study might provide more reliable information in the overall tempo judgements.
It should be noted that the audiovisual source locations might also contribute to the weight in audiovisual judgements. In a study of multisensory simultaneity (Di Luca, Machulla, & Ernst, 2009), it has been shown that in the headphonebased relative to the co-location audiovisual presentation, the auditory estimates are likely to be biased as they are trusted less. However, the contribution of this spatial discrepancy to the visual-dominant temporal judgements, if any, is likely very mild, given that the headphone-based presentation potentially reduced the interference of other external sounds, which would potentially boost the reliability of the auditory modality. In a similar vein, visual and tactile stimuli appearing in different spatial location were associated with less accurate discrimination responses than those in the same location , indicating that the interference from one modality might pose a threat to the credibility of the other. In the current study, the auditory source that was closer to the participants should have provided a more reliable source of temporal information than the visual displays, yet failed to do so.
The discrepancy between auditory and visual tempi in our study was reflected by the naturalness ratings. The current results suggest that high perceived naturalness could be particularly derived from a fast visual tempo, as well as from high audiovisual temporal congruence. Not surprisingly, the smaller the audiovisual temporal discrepancy, the more natural the stimulus was perceived to be. Stimuli with small or no discrepancies presumably posed the least difficulty in binding multisensory inputs to one (Vatakis & Spence, 2008). Our results align well with past findings in that meaningful visual motion, especially when following an expected movement direction, has a strong impact on timing and, as shown in other research, sensorimotor synchronisation . This finding was supported by research comparing the effect of biological movements (finger tapping) with abstract visual stimuli (flashes) on timing accuracy, which found higher stability when synchronising with finger movements than with flashes (Hove & Keller, 2010).
There is scarce neurobiological evidence supporting the role of visual input compared to auditory input in temporal processing. Evidence for the latter including beat detection and time estimation, however, can be found mostly in studies using abstract unimodal stimuli (for a review, see Buonomano & Maass, 2009). An fMRI study where participants were asked to discriminate multisensory inputs (visual, auditory, and tactile) revealed that the auditory dorsal pathway was partly specific to beat processing and functioned as a supra-modal network (Araneda et al., 2017). In Kanai and colleagues' (2011) study, Transcranial Magnetic Stimulation (TMS) disrupted the activities in the auditory cortex and consequently impeded the participants' performances in a two-alternative-force choice task where two durations, presented either in pure tones or visual flickers, were compared. By contrast, disrupting the activities in the primary visual cortex only affected the performances of duration judgements with visual stimuli, suggesting the dominant role of the auditory cortex in temporal processing. The evidence above, nevertheless, may not generalise to neural mechanisms of timing with naturalistic stimuli. Biological motion carries spatiotemporal information that helps in forming action predictions, along with the timing of the action. The findings with behavioural data in the current study call for future neurobiological research on the visual dominance with meaningful visual stimuli such as continuous movements.
Furthermore, attentional processes might also contribute the 'visual driving' effect observed in our study. Visual dominance in spatial attention has been supported by ample studies. In an audiovisual context, the visual modality was associated with faster reaction time and less response errors in modality-switching, spatially-incongruent tasks (Lukas, Philipp, & Koch, 2010) and detected with greater sensitivity (Spence et al., 2012). The Colavita effect, more specifically, referred to a phenomenon where visual stimuli were associated with higher salience than auditory stimuli when both appeared simultaneously (Colavita, 1974). The perception of human biological motions in point-light displays led to even higher visual salience than abstract visual displays (Johansson, 1973). Selective attention oriented towards biological movement, particularly motion with a purpose compared to scrambled motion, has been shown to activate the part of the motor cortex associated with action mirroring (Gao et al., 2014). The 'imagined' imitation led to determination of action intention (Knoblich & Sebanz, 2008), in this case prediction of motion trajectories and their spatiotemporal information. The allocation of attention to natural motions (the PLD in the current study) could then explain participants' reliance for visual tempo. According to the Dynamic Attending Theory (Jones & Boltz, 1989), the pace of the internal clock is subject to the environmental rhythm when the limited attentional resources orient towards the rhythm of exogenous stimuli. Synchronising one's attentional 'pulses' with environmental rhythms is also known as the entrainment effect -in this case, participants were predominantly under the influence of the visual tempo. The biological motion in the PLD was presumably attended to more often than the drumbeats, therefore dominating the perceiver's tempo judgement.
Both naturalistic biological motion and spatial attention towards to biological motion contribute to the visual driving effect as observed in the current study. Yet a few questions remain: firstly, by modifying the ecological validity of auditory stimuli (e.g., beats vs more complex music), it is possible to observe changes in overall tempo judgement. Secondly, instead of multisensory gain, it could be investigated whether there is also a multisensory loss, or more specifically, whether the presence of inconsistent multisensory information could impede temporal processing such as timing or duration judgement. Lastly, studies may explore whether other senses such as tactile perception are capable of multisensory integration and what their weights are in the timing process. A few studies have attempted to explore the potential of tactile-assisted metre judgement when auditory or visual stimuli were presented (Araneda et al., 2017;Huang et al., 2012). In a rhythm pattern identification task, for example, congruent vibrations (tactile) and tones (auditory) raised the correct rhythm discrimination rate to 90%, while incongruent inputs resulted in a decline (Huang et al., 2012). Interestingly, the correct rate was significantly higher when the dominant (correct) pattern was presented with sounds than with vibrations, again confirming the dominant role of audition.
There are a number of limitations that should be addressed in future studies. First, a response bias in the decision process, which might be reflected by participants' preference towards one modality under certain tempo conditions, cannot be fully ruled out. According to the causal inference model (Körding et al., 2007), response bias towards one source of information can be observed when the key feature (tempo) differs between two sources (sensory modalities) to an extreme extent. However, if such a response bias widely existed among the sample population, the stimuli with a large audiovisual gap, regardless of the modality of the fast tempo, should be equally often judged to be 'fast' , which would be reflected by a (inverted) U-shaped threshold of equality in Fig. 3. Secondly, considering the essential role of naturalistic stimuli, the perceived predictability of the auditory beats and the PLD has not been quantitatively pre-evaluated. For visual stimuli, the predictability should take into account direction compatibility as in Hove and colleagues' (2010) work. In future studies, a baseline test prior to the experiment collecting familiarity and naturalness ratings, as well as eye fixation concerning the compatibility of motion, should be considered. The imbalance between the two modalities can be minimised by collecting the perceived naturalness of both auditory and visual stimuli respectively from multiple independent raters before the experiment commences. As for auditory stimuli, the predictability is frequently measured by sensorimotor synchronisation accuracy in tapping tasks (e.g., Stupacher et al., 2017). Furthermore, past studies tended to require the participants to judge the temporal information of one modality only (e.g., Klink, et al., 2011). The advantages and disadvantages of an experimental paradigm that allows participants the liberty to exhibit their modality reliance should be further examined. In addition, the current study did not systematically evaluate differences between auditory and visual attention. Future studies should seek to disentangle the effects of multisensory attention, especially visual attention, from the effect of naturalistic stimuli on temporal judgements. To verify the link between a visual driving effect and direction compatibility, future studies should also consider a control condition in which inverted biological movements are presented.
Taken together, the current study provides evidence for a visual driving effect in multisensory tempo judgements with meaningful movements. On the group level, visual tempo contributed more to the overall tempo judgement than the auditory tempo. On the individual level, in addition, when presented with tempoinconsistent audiovisual stimuli, more participants relied on the visual tempo to make the overall tempo judgements. The modality reliance provided insights into tempo judgement strategies adopted by different individuals. Future studies should further investigate the apparent dominance of visual information in timing with real-life audiovisual scenes as well as the factors influencing individuals' modality reliance in temporal judgements.